Induction Of Classification From Lexicon Expansion: Assigning Domain Tags To WordNet Entries

نویسندگان

  • Echa Chang
  • Chu-Ren Huang
  • Sur-Jin Ker
  • Changhua Yang
چکیده

We present in this paper a series of induced methods to assign domain tags to WordNet entries. Our prime objective is to enrich the contextual information in WordNet specific to each synset entry. By using the available lexical sources such as Far East Dictionary and the contextual information in WordNet itself, we can find a foundation upon which we can base our categorization. Next we further examine the similarity between common lexical taxonomy and the semantic hierarchy of WordNet. Based on this observation and the knowledge of other semantic relations we enlarge the coverage of our findings in a systematic way. Evaluation of the results shows that we achieved reasonable and satisfactory accuracy. We propose this as the first step of wordnet expansion into a bona fide semantic network linked to real-world knowledge. 0. Introduction1 WordNet is a lexicon comprising of nouns, verbs, adjectives and adverbs. Its basic 1 This research is partially funded by an IDLP project grant from the National Science Council of Taiwan, ROC. Work reported in this paper was carried out in summer 2001, during Chang's internship at Academia Sinica. We are indebted to two anonymous reviewers of SemaNet 2002, as well as from the First International WordNet Conference for their helpful comments. An earlier version of this paper was accepted by the first IWC but was not presented because of the authors' travelling difficulties at that time. We thank colleagues at Academia Sinica, especially Shu-Chuan Tseng, Keh-jiann Chen, and members of the WordNet group, for their input and help. organization is based on different semantic relations among the words. Entries (or lemas) sharing the same meaning is grouped into a synset and assigned with a unique sense identification number for easy retrieval and tracking purposes. This unique offset number gives the information about the parts of speech and the hierarchy position to which a specific synset belongs. For nouns and verbs the synsets are grouped into multiple lexical hierarchies; modifiers such as adjectives and adverbs are simply “organized into clusters on the basis of binary opposition (antinomy).” [1] This lexical hierarchy makes the lexical domain assigning task more straightforward because it coincides with a ontological taxonomy in many aspects. The primary objective of our project is to enrich the WordNet knowledge content due to the fact that “WordNet lacks relations between related concepts.” [2] We adopt WordNet itself, together with other lexical resources to develop an integrated domain specific lexical resource. 1. The Five Tagging Methods Starting with two lexical resources, we employed five steps to assign and expand domain tags. Basically, the explicit domain information from Far East Dictionary as well as WordNet's own hierarchy of semantic relation are used to extend the coverage of domain assignment. 1.1 Domain Data Lookup from Far East Dictionary The digital file of Far East Dictionary contains complete information for each word entry that can be found in an ordinary printed version. Most of all, it lists the domain information for each vocabulary wherever possible. Thus we employ the available data from a text source file (each vocabulary entry is organized as one single row) and extract all the information by running a string manipulation program coded in Visual Basic. During the extraction process we only take into account the part of speech of each word in Far East Dictionary. Next, we map the domains obtained from Far East Dictionary if the word and its part of speech coincides with the entries in our database which contains a complete list of synset. Since WordNet collects only nouns, verbs, adjectives and adverbs, we only extract the domain data that falls into these four categories. Later we group the information in a database table and extent the assigned domains of each word to its synset. Table 1 is an example of our database table which 'contains all the adverbial uses of `aback.' id term domain 00073303R aback aviation, 00073386R aback aviation, Table 1 Example of The Far East Dictionary Domain Database Table In Table 1, it is shown that 'aback' has two adverbial senses. Since in Far East Dictionary “aback” is labeled with domain 'aviation,' extra work of expansion is necessary to further label all of its adverb synset with the same domain to maintain the integrity of the information. Because both the extraction and expansion method would produce ambiguities in domain assignment, manual verifications are required in the future. 1.2 Extracting Domain Information from WordNet Sense Description Each WordNet entry (i.e. each synset) is followed by its sense. Although there is no specifically defined set of controlled vocabulary, the sense definition does specify the field the synset members are commonly used in that specific field of study, such as biology, physics or chemistry. This specification comes in a special format contained in a bracket for each WordNet entry so that extraction of data is possible and straightforward. Due to the fact that each domain is directly extracted by its corresponding synset, there is simply no ambiguity in assigning the domain tags. And if there is more than one lexical item in that synset, all will share the same domain tag. 1.3 Establishment of a Common Domain Taxonomy for Nouns Each lexical resource uses a different domain taxonomy, which may be explicitly defined or implicitly assumed. Hence, when combining domain information from multiple sources, the establishment of a Common Domain Taxonomy (CDT) is crucial for both efficient representation as well as effective knowledge merging. Our survey of existing domain taxonomy, including LDOCE, HowNet, Tongyici Cilin, etc., show that there is quite a lot in common. Hence we decide to build a working CDT based on the two resources we have. Note that since our goal is to establish a domain taxonomy for wordnets (for English now and for Chinese in the future), the existing domain information in WordNet need to be assumed as defaults that can be over-ridden. Hence a model of CDT based on basic binary combination involving WordNet is necessary. After collecting all the domain tags from the two resources, we build our CDT. First, all common domain nodes are put in a hierarchy based on their relation. Second, inconsistent domain names are resolved. Last, when gaps appear after all domain tags are attached to the taxonomy, new domain categories are adopted to fill in the gaps and make a more complete CDT. Since top taxonomy presupposes a particular view on conceptual primacy and may differ in different lexical sources, we took a bottom-up approach to our CDT. That is, right now each taxonomy tree now stops at some broad-consensus level without being committed to a higher taxonomy. The following is a partial list of our current CDT. Humanity Linguistics Rhetorical Device Literature History Archeology ... Social Science Sociology Statistics Economics Business Finance ... Formal Science Mathematics Geometry Algebra ... Natural Science Physics Nuclear Chemistry Biology Palaeontology Botany Animal Fish Bird ... Applied Science Medicine Anatomy Physiology Genetics Pharmacy Agriculture ... Fine Arts Painting Sculpture Architecture Music Drama ... Entertainment Sports Balls Track & Field Competition Game Board Card ... Proper Noun Name Geographical Name Country Religion Trademark ... Humanity Archaic Informal Slang Metaphor Formal Abbreviate ... Lexical Sources Latin Greece Spanish French American ... Please note that by induction and actual examples from the lexical organization in WordNet, it is found that a hyponym is very likely to belong to the same domain as its hypernym. Similar results are also found for wordnet based cross-lingual inference of lexical semantic relations [4]. For instance, under the term 'mathematics,' all the hyponyms below are related to this field of study. To make us of this lexical semantic phenomenon, we make a table of all the domain terms and map them to their unique WordNet sense identification number. Later we use the tree expansion method (discussed in more detail in Section 2.4) to trace down all the hyponyms. For example, by using this method, the hyponyms of Linguistics are all labeled as 'linguistics' and so forth. 1.4 Lexical Hierarchy Expansion of Nominal Domain Assignment WordNet has is a lexical semantic hierarchy linking all synsets with lexical semantic relations. We convert all the relations to a database in a relational table, as shown in Table 2 [1]: Hypernym ID Hyponym ID Relation 00001740A 04349777N = 00001740A 00002062A ! 00001740N 00002086N ~ ... ... ... Table 2 Lexical Relation Table The relation symbols in Table 2 are adopted from the WordNet database files. These symbols are saved with each synset entry to indicate a specific semantic relation with other synsets. The implemented information allows us to trace and locate all the related synsets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Characterization of Wordnet Features in Boolean Models For Text Classification

Supervised text classification is the task of automatically assigning a category label to a previously unlabeled text document. We start with a collection of pre-labeled examples whose assigned categories are used to build a predictive model for each category. In previous research, incorporating semantic features from the WordNet lexical database is one of many approaches that have been tried t...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Using Multilingual Resources for Building SloWNet Faster

This project report presents the results of an approach in which synsets for Slovene wordnet were induced automatically from parallel corpora and already existing wordnets. First, multilingual lexicons were obtained from word-aligned corpora and compared to the wordnets in various languages in order to disambiguate lexicon entries. Then appropriate synset ids were attached to Slovene entries fr...

متن کامل

Extending NomLex-PT using AnCora-Nom

This work describes how we used AnCora-Nom, a Spanish nominalization lexicon, to extend NomLex-PT, a lexical resource for Portuguese, originally based on the English NomLex lexicon and fully integrated to OpenWordNet-PT, our freely available Portuguese WordNet. The complete Spanish lexicon, which contains 1,655 entries, was translated to Portuguese and then compared to our previous data. Furthe...

متن کامل

Automatic Acquisition of Domain Specific Lexicons

In this paper we present the results of three years of experiments about automatic acquisition of domain specific terminology from corpora. We present an analysis of the potentiality and limitations of the Term Categorization approach to lexical acquisition, and we propose a novel methodology to approach the task, consisting on applying Latent Semantic Kernels to estimate term similarity. We fi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002